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Abstract 

This  paper  presents  some  formal  results  on  learning.  In  particular,  it  concerns  algorithms  that  learn 
sets  and  functions  from  examples.  We  seek  conditions  necessary  and  sufficient  for  learning  over  a  range 
of  probabilistic  models  for  such  algorithms. 
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1.  Introduction 

This  paper  concerns  algorithms  that  learn  sets  arxJ  functions  from  examples  for  them^The  results 
presented  in  this  paper  appeared  in  preliminary  form  in.(Nataraian,  1986;  1988lAThe  motivation  behind 
the  study  is  a  need  to  better  understand  the  class  of  problems  known  as  'concept  learning  problems”  in 
the  Artificial  Intelligence  literature. 

What  follows  is  a  brief  definition  of  concept  (or  set)  teaming.  Let  £  be  the  (0,1)  alphabet,  £*  the  set  of 
all  strings  on  £,  and  for  any  positive  integer  n,  £*  the  set  of  strings  on  £  of  length  n.  Let/denote  a  subset 
of  £*  and  F  a  set  of  such  subsets.  An  example  for /  is  a  pair  (x,>),  jce  £”,  ye  £,  such  that  xe  /  iff  y=l. 
Informally,  a  learning  algorithm  for  F  is  an  algorithm  that  does  the  following:  given  a  sufficiently  large 
number  of  randomly  chosen  examples  for  any  set  /  e  F,  the  algorithm  identifies  a  set  ^  e  F,  such  that  g 
is  a  good  approximation  of/.  (These  notions  will  be  formalized  later.)  The  primary  aim  of  this  paper  is  to 
study  the  relationship  between  the  properties  of  F  arxj  the  number  of  examples  necessary  and  sufficient 
for  any  learning  algorithm  for  it. 

To  place  this  paper  in  perspective:  There  are  numerous  papers  on  the  concept  learning  problem  in 
the  artificial  intelligenca  literature.  See  [Michalski  et  al.,  1983]  for  an  excellent  review.  Much  of  this  work 
is  not  formal  in  approach.  On  the  other  hand,  many  formal  studies  of  related  problems  were  reported  in 
the  inductive  inference  literature.  See  [Angluin  &  Smith,  1983]  for  an  excellent  review.  As  it  happened, 
the  wide  gap  between  the  basic  assumptions  of  inductive  inference  on  the  one  hand,  and  the  needs  of 
the  empiricists  on  the  other,  did  not  permit  the  formal  work  significant  practical  import.  More  recently, 
[Valiant,  1984]  introduced  a  new  formal  framework  for  the  problem,  with  a  view  towards  probabilistic 
analysis.  The  framework  appears  to  be  of  both  theoretical  and  practical  interest,  and  the  results  of  this 
paper  are  based  on  it  and  its  variants.  Related  results  appear  in  [Angluin,  1987;  Rivest  &  Schapire,  1987; 
Bemian  &  Roos,  1987;  Laird,  1986;  Kearns  et  al..  1966]  amongst  others.  [Blumer  et  al.,  1966]  present  an 
indepeixlent  development  of  some  of  the  results  presented  in  this  paper,  their  proofs  hinging  on  some 
classical  results  in  probability  theory,  while  ours  are  rrrostly  combinatorial  in  flavour. 

We  begin  by  describing  a  formal  model  of  learning,  our  variant  of  the  model  first  presented  by 
[Valiant,  1984].  Specifically,  we  define  the  notion  of  polynomial  leamability  of  sets  in  Section  2.  We  then 
discuss  the  notion  of  asymptotic  dimension  of  a  family  of  concepts,  and  use  it  to  obtain  necessary  and 
sufficient  conditions  for  leamability.  In  doing  so,  we  give  a  general  learning  algorithm  that  turns  out  to  be 
surprisingly  simple,  though  provably  good.  Section  3  deals  with  a  slightly  different  learning  model,  one  in 
which  the  learner  is  required  to  learn  with  one-sided  error,  i.e.,  his  approximation  to  the  set  to  be  learned 
must  be  conservative  in  that  it  is  a  subset  of  the  set  to  be  learned.  Section  4  deals  with  the  time 
complexity  of  learning,  identifying  necessary  and  sufficient  conditions  for  efficient  learning.  Section  5 
generalizes  the  learning  model  to  consider  functions  instead  of  sets,  instead  of  sets.  Notions  of 
asymptotic  leamability  and  asymptotic  dimension  are  defined  in  this  setting  and  necessary  and  sufficient 
conditions  for  leamability  obtained.  This  requires  us  to  prove  a  rather  interesting  combinatorial  result 
called  the  generalized  shattering  lemma.  Rnally,  Section  6  deals  with  a  non-asymptotic  model  of 
learning,  where  the  division  is  between  finite  and  infinite,  rather  than  on  asymptotic  behaviour.  In 
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particular,  we  consider  learning  sets  arxt  functions  on  the  reals,  introducing  the  notion  of  finite-leamabiiity. 
We  review  the  elegant  results  of  [Blumer  et  al.,  1986]  on  conditions  necessary  and  sufficient  for 
leamabillty  in  this  setting.  We  then  identify  conditions  necessary  and  sufficient  for  the  finite-leamabiiity  of 
functions  on  the  reals. 


4 


2.  Feasible  Leamabillty  of  Sets 

We  begin  by  describing  our  variant  of  the  learning  framewortt  proposed  by  [Vaiiant,  1 984], 

Let  £  be  the  binary  alphabet  (O.l),  £*  the  set  of  ail  strings  on  £,  and  for  any  positive  integer  n,  let 
be  the  set  of  strings  of  length  n  or  less  in  r.  A  concept^  f  is  any  subset  of  £*.  Associated  with  each 
concept/is  the  membership  function f"X-*  (0,1),  such  that/^x) » l  iff  x  e  /.  Unless  otherwise  required, 
we  will  drop  the  superscript  inf"  and  use /to  refer  both  to  the  function  and  to  the  set.  An  exampie  for  a 
concept  is  a  pair  (xo'),x€  £*,  ye  (0,1)  such  that  y  a/[x).  A /amr/y  of  concepts  F  is  any  set  of  concepts  on 
£*.  A  ieaming  algorithm  (or  nrwre  generally,  a  learning  function)  for  the  family  F,  is  an  algorithm  that 
attempts  to  infer  approximations  to  a  concept  in  F  from  examples  for  it.  The  algorithm  has  at  its  disposal 
a  subroutine  EXAMPLE,  which  when  called  returns  a  randomly  chosen  example  for  the  concept  to  be 
learned.  The  example  is  chosen  randomly  according  to  an  arbitrary  and  unknown  probability  distribution  p 
on  £*,  in  that  the  probability  that  a  particular  example  (x/(x))  will  be  produced  at  any  call  of  EXAMPLE  is 
P(x). 

Oafn:  Let  /  be  a  concept  and  n  any  positive  integer.  The  projection  /,  of  /  on  r*~  is  given  by  /„  = 
/nZ"-. 

Oefn:  Let  5  be  any  set.  A  sequence  on  S  is  simply  a  sequence  of  elements  of  S.  denotes  the  set 
of  all  sequences  of  length  /  on  S,  while  denotes  the  set  of  all  sequences  of  finite  length  on  s. 

Oefn:  Let /be  a  concept  on  Z*  and  P  a  probability  distribution  on  Z*.  A  sample  of  size  l  for/ with 
respect  to  /*  is  a  sequence  of  the  form  (xj/xi)),  (x2/(x2)),...,(x,/[x,))  where  Xj,  xj,...,  x,  is  a  sequence  of 
elements  of  Z*.  randomly  and  independently  chosen  according  to  P. 

Oefn:  Let /and  ^  be  any  two  sets.  The  symmetric  difference  of /and  g,  denoted  by/Ag,  is  defined 
by/Ag  =  (/^g)vj(g-/). 

With  these  supporting  definitions  in  hand,  we  present  our  main  definition.  Intuitively,  we  will  call  a 
family  F  feasibly  leamabie  if  it  can  be  learned  from  polynomially  few  examples,  polynomial  in  an  error 
parameter  h  and  a  length  parameter  n.  The  length  parameter  n  controls  the  length  of  the  strings  the 
concept  is  to  be  approximated  on,  and  the  error  parameter  h  controls  the  error  allowed  in  the  learnt 
approximation. 

Oefn:  Formally,  a  family  F  is  feasibly  leamabie  if  there  exists  an  algorithm^  a  such  that 

(a)  A  takes  as  input  two  integers  n  and  h,  where  n  is  the  size  parameter,  and  h  is  the  error 
parameter. 

(b)  A  makes  polynomially  few  calls  of  EXAMPLE,  polynomial  in  n  and  h.  EXAMPLE  returns 
examples  for  some/e  F,  where  the  examples  are  chosen  randomly  and  independently  according 


use  ttra  (erm  concept  instead  of  a  set  to  oonfonn  with  the  artificial  inteNigenca  literature. 

^Unless  slated  otherwise,  by  'atgorithre*  we  mean  a  finiiely  representable  procedure,  not  necessarily  computable.  That  is.  the 
procedure  might  use  well-defined  but  non-computabla  functions  as  primilives. 
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to  an  arbitrary  and  unknown  probability  distribution  p  on 

(c)  For  all  concepts/  e  F  and  all  probability  distrtxjtions  P  on  £'*~,  with  probability  (l-l/A),  A  outputs 
a  concept  ge  F  such  that 

y  m  5  Mh 

Defn:  Let  N  be  the  set  of  natural  numbers.  The  learning  function  4':NxNxX(£*x(0,l))->F 

associated  with  a  learning  algorithm  a  is  defined  as  follows. 

Learning  Function  ^ 

Input  n,  A:integers;  C:  sample: 
begin 

Let  C  =  (xj.  yj).  (xj.  yj),... 

Run  A  on  inputs  nji; 

In  place  of  EXAMPLE,  at  the  i**  call  of  EXAMPLE  by  A, 
give  A  (Xi  j,)  as  example. 

Output  A's  output. 

end 


We  now  introduce  a  measure  called  the  dimension  for  a  family  of  concepts.  Recall  that  we  defined 
the  projection/^  of /on  1"  by/,  =  ynl")  Similarly,  the  projection  F,  of  the  family  F  on  I"  is  given  by  F,  = 
{/,1/  e  F).  We  call  F,  the /i^-subfamily  of  F. 

Oefn:  The  dimension  of  a  subfamily  F„,  denoted  by  dtm(F^  is  defined  by 
dim{F^  a  log2(}>FJ). 

(Notation:  For  a  set  X,  1X1  denotes  the  cardinality,  while  for  a  string  x,  bd  denotes  the  string  length.) 

Oefn:  Let  <i:N-»N  be  a  function  of  one  variable,  where  N  is  the  natural  numbers.  The  asymptotic 
dimension  (or  more  simply  the  dimension)  of  a  family  F  is  d{n)  if  dim(f^  »  0(d(n)).  That  is,  there  exists  a 
constant  c  such  that 
V  n  :  dim(Fi^  <  d(n) 
and  dim(F^  'i  cd(n)  infinitely  often. 

We  denote  the  asymptotic  dimension  of  a  family  F  by  dim{F).  We  say  a  family  F  is  of  polynomial 
dimension  if  the  asymptotic  dimension  of  F  is  a  polynomial  in  n. 

With  these  definitions  in  hand,  we  can  give  our  first  result.  The  result  is  a  lemma  concerning  the 
notion  of  shattering.  Let  F  be  a  family  of  subsets  of  set  X.  We  say  that  F  shatters  a  set  5cX,  if  fcr  every 
S,  cS,  there  exists /e  F  such  that  /nS  =  S,.  To  our  knowledge,  this  notion  was  first  introduced  by  [Vapnik 
&  Chervonenkis,  1971]. 

We  can  now  state  our  first  result. 

Lemma  1  (Shattering  Lemma:)  If  F„  is  of  dimension  d,  then  F^  shatters  a  set  of  size 
^ceUing{d/(n+2)).  Also,  every  set  shattered  by  F  is  of  size  at  most  d. 


^ctUiHg(r)  is  the  least  integer  greater  than  r. 
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Proof:  First,  prove  the  upper  bound.  Suppose  a  set  S  is  shattered  by  by  F^.  Since  there  are  2'^ 
distinct  subsets  of  F^.  it  follows  from  the  definition  of  shattering  that  2*^  ^  IF^I.  Taking  logarithms  on  both 
sides  of  the  inequality,  we  get  ISI  ^  log{\FJ)  »  d,  which  is  as  desired.  To  prove  that  the  upper  bound  can 
attained,  simply  let  F  be  all  possible  subsets  of  some  d  strings  in 

We  prove  the  lower  bound  part  of  the  lemma  through  the  following  claim.  A  variant  of  the  claim  is 
given  by  Vapnit  &  Chervonenkis  (1971)  amongst  others. 

Claim:  Let  X  be  any  finite  set  and  let  //  be  a  set  of  subsets  of  X.  If  k  is  the  size  of  the  largest  subset 
of  X  shattered  by  //,  then 

Wt  <  (IXM-D* 

Proof:  By  induction  on  IXI,  the  size  of  X. 

Basis:  Clearly  true  for  IXI  =1. 

induction:  Assume  the  claim  hoids  for  IXI  =  m  and  prove  tnje  for  m+l.  Let  IXI  =  m-t-l  and  let  H  be  any 
set  of  subsets  of  X.  Also,  let  i;  be  the  size  of  the  largest  subset  of  X  shattered  by  H.  Pick  any  xe  X  and 
partition  X  into  two  sets  (x)  and  Y  »  X-{x}.  Define  to  be  the  set  of  all  sets  in  H  that  are  reflected  about 

X.  That  is,  for  each  set  in  there  exists  a  set  A  e  H  such  that  A  differs  from  A,  only  in  that  A  does  not 
include  X.  Formally, 

//j  =«  [Ajl  Aj  e  A/.3A6  //.  A^tAj  and  Aj  *  A«j{x)). 

Now  define  Surely,  the  sets  of  H2  can  be  distinguished  on  the  elements  of  Y.  That  is,  no 

two  sets  of  //j  can  differ  only  on  x,  by  virtue  of  our  definition  of  Hence,  we  can  consider  H2  as  sets 
defined  on  Y.  Surely,  //j  cannot  shatter  a  set  larger  than  the  largest  set  shattered  by  H.  Hence,  H2 
shatters  a  set  no  bigger  than  A.  Since  lH  ^  m,  by  the  inductive  hypothesis  we  have  l/Zjl  ^  (iTi+i)*. 

Now  consider  H^.  By  definition,  the  sets  of  //,  are  all  distinct  on  Y.  That  is,  for  any  two  distinct  sets 
Aj,  A^  in  //,,  Ajny  *  Ajoy.  Suppose  //,  shattered  a  set  S  c  7.  ISI  s  A.  Then,  H  would  shatter  Su{x).  But, 
I5u(x)l^  A-fi,  which  is  impossible  by  assumption.  Hence,  shatters  a  set  of  at  nnost  (A-l)  elements  in 

Y.  By  the  inductive  hypothesis,  we  have 
W,l  S  (171+1)*''. 

Combining  the  two  bounds,  we  have 
W1  =  +  l//,l  »  Wjl  +  l//,l 

S  (in+l)*  +  (IJT+l)*-'  S  (m+l)*  +  (m+l)*-' 

S  (m+l)*-'(m+2)  S  (m+2)*  S  (IXI  +1)*. 

Thus  the  claim  is  proved.* 

Returning  to  the  lemma,  we  see  that  if  X  is  all  strings  of  length  n  or  less  on  the  binary  alphabet,  ixi  = 
2'**K  By  our  claim,  if  the  largest  set  shattered  by  is  of  size  A, 
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IFJ  5  (2'^‘+l)* 

S  tog(IF,0/to«(2'^‘+l)  2  dim{F^t(n+2). 

Since  k  must  be  an  integer,  we  take  the  ceiling  of  the  right-hand  side  of  the  last  inequality.  This 
completes  the  proof  of  the  lemma.  • 

We  can  now  use  this  lemma  to  prove  the  main  theorem  of  this  section. 

Theorem  1 :  A  family  F  of  concepts  is  feasibly  leamable  if  and  only  if  it  is  of  polynomial  dimension. 

Proof:  (If)  Let  F  be  of  dimension  d(n).  The  following  is  a  learning  algorithm  for  F,  satisfying  the 
requirements  of  our  definition  of  leamabiiity. 

Leamlng_Algortthm  a, 

input:  n.  h 
begin 

call  EXAMPLE  hidim(F^)lni2)  +  ln(h))  times, 
let  S  be  the  set  of  examples  seen, 
pick  any  concept  ^  in  f  consistent  with  S 
output  g. 

end 


We  need  to  show  that  Aj  does  indeed  satisfy  our  requirements.  Note  that  a,  may  not  be 
computable,  but,  as  noted  earlier,  this  is  not  a  difficulty.  Let/ be  the  concept  to  be  learned.  Since  is  a 
distribution  on  S",  EXAMPLE  returns  examples  of/,.  We  require  that  with  high  probability,  Aj  should 
output  a  concept  g  e  F,  such  that  the  probability  that  /  and  g  differ  is  less  than  (l/h).  Let  C^(f)  be  all 
concepts  in  that  differ  from  /„  with  probability  greater  than  1//l  By  definition,  for  any  particular  g  such 
that  g„  €  the  probability  that  any  call  of  EXAMPLE  will  produce  an  example  consistent  with  g  is 
bounded  by  (l-l/Zi).  Hence,  the  probability  that  m  calls  of  EXAMPLE  will  produce  examples  all  consistent 
with  g  is  bounded  by  (l-i/Ay".  And  hence,  the  probability  that  m  calls  of  EXAMPLE  will  produce  examples 
all  consistent  with  any  g„  €  C^(/)  is  bounded  by  We  wish  to  make  m  sufficiently  large  to 

bound  this  probability  by  l/h. 

IC*(/)Kl-l/Ar  5  Uh. 

But  surely,  IC*(/)I  ^  IP,!  ^  2<*") 

Hence,  we  warn 

s  l/h 

Taking  natural  logarithms  on  both  sides  of  the  inequality,  we  get 
d{n)ln{2)  +  m  ln{\-\/h)  <  ln{\/h) 

-m  ln{\-\/h)  >  d(n)ln(2)  +/n(h) 

-m  (-l/h)  >  d(n)ln(2)  +  ln(h) 

Or 

m  >  h(d(n)ln(2)^ln(h)). 

Hence,  if  h(<i(n)ln(2)+ln(h))  examples  are  drawn,  the  probability  that  all  the  examples  seen  are  consistent 
with  a  concept  that  differs  from  the  true  concept  by  i/h  or  more,  is  bounded  by  l/h.  Since,  A,  draws  as 


many  examples  and  outputs  a  concept  consistent  with  the  examples  seen,  with  probability  1-i/A,  ^4,  will 
output  a  concept  that  differs  from  the  troe  concept  with  probability  less  than  l/A.  Hence,  /i,  does  satisfy 
our  requirements.  Clearly,  if  d(,n)  is  a  polynomial  in  n,  the  number  of  examples  called  by  >4,  is  polynomial 
in  n,  h  and  hence  F  is  feasibly  leamable. 

(only  if) 

Now  suppose  that  f  is  of  super-polynomial  dimension  d(n)  and  yet  F  were  feasibly  leamable  by  an 
algorithm  A  from  (nh)*  examples,  for  some  fixed  t  Let  'F  be  the  learning  function  corresponding  to  a. 
Now  pick  n  and  A  2  5  such  that 
dtm(F^  S  2(n+l)(rtA)*. 

By  the  shattering  lemma,  there  exists  a  set  5  c  £”  such  that  LSI 2  dim{F^I{n*\),  and  S  is  shattered  by  F^. 

Let  denote  the  sequence  xj,  . x,  and  let/e  F^.  Define  the  operator  5  as  follows. 

y  P(x) 

where  g  =  'F(n,  A ,  (xi^^xj)).  (x2/(x2)),...(x,/(x,))). 

In  words,  S(^.  x',  40  is  the  probability  error  in  the  concept  output  by  /4  on  seeing  the  sample  (x,;X^,)), 
for/.  Let  G,  c  be  such  that  for  each  5,  cS.  there  is  exactly  one  «  g  such  that  gnS 
=  Sj.  Such  G„  must  exist  as  F,  shatters  S.  Let  F  be  the  probability  distribution  that  is  uniform  on  S  and 
zero  elsewhere. 

Claim:  Let  /  =»  (nA)*.  Then  for  each/e  G,,  and  ^eSf,  there  exists  unique  g€  G„  such  that 
<  l/A  if  and  only  if  8(gjf',40  S  1/A. 

Proof:  Let  {X'}  denote  the  set  of  strings  occuring  in  X',  i.e.,  {X')  =  {xix  occurs  in  X*).  By  the 
definition  of  G„,  for  each /,  x*,  there  exists  unique  geg^  such  that/Ag  »  5-{X*} .  Hence, 

5(/-X'.'I0  +  5(gjc',40  >  X  F(x) 

>  1/2. 

The  last  step  follows  from  the  fact  that  (X'}  has  at  most  half  as  many  elements  as  S,  and  P  is  uniform  on 
S.  Since  A55,  l/A  s  1/5,  at  most  one  of  the  terms  on  the  left  can  be  smaller  than  (1/5),  if  the  inequality  is 
to  hold.  Hence  the  claim.  • 


Since  'P  is  a  learning  function  for  f,  for  each/e 
Pr{8(/,X',  40  s  l/A)}  s  (l-l/A) 

(Notation:  Pr{Y}  denotes  the  probability  of  event  Y.) 


Define  the  switch  function  e:(true,false}  ->  N  as  follows.  For  any  boolean-valued  predicate  Q, 


1,  if  Q  is  troe 
0  otherwise 


Now  write 


5  l/A)  = 


Y  0(6(/,x^4O  a  l/A)Pr{x^) 
x/7s‘ 
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Substituting  the  above  in  the  last  inequality,  we  get, 
y  9(8(^X'.'I0  s  lM)Pr{xO  2  (1-i/A) 

Summing  over  G,, 

Flipping  the  order  of  the  sums, 

y  Y,  0(S(r,X','P)  2  l/A)Pr{X')  5 

By  the  last  Claim, 

y  9(5(rM'F)  s  \lh)PKX^  s  ^  (l/2)Pr{x'} 


y  2  l/A)Pr{X/)  2 


A 


(l-l/A) 


y  (i-i/A) 


/6 


Hence,  we  have 


y  (i/2Vf(^)  2  y  (1-1//.) 


Flipping  the  order  of  the  sums  again, 

Y  y  (l/2VrGf')2  y  (1-1/A) 


,  y  (1/2)PKX')2  , 


Which  reduces  to 
^(1/2)^ 
■'/I 


(1-1/A) 


which  is  impossible  as  A  2  S. 

The  last  contradiction  implies  that  A  cannot  be  a  learning  algorithm  for  F  as  supposed  and  hence  the 
result. 


This  completes  the  proof. 
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3.  Learning  Sets  with  One-Sided  Error 

We  now  consider  a  learning  framework  in  which  the  learner  is  only  allowed  to  see  positive  examples 
for  the  concept  to  be  learned,  arxf  is  required  to  be  conservative  in  his  approximation  in  that  the  concept 
output  by  the  learner  must  be  a  subset  of  the  concept  to  be  learnt.  Historically,  this  was  the  framework 
first  studied  by  [Valiant,  1984]. 

Let  F  be  the  family  of  concepts  to  be  learned.  EXAMPLE  produces  positive  examples  for  some 
concept  /  e  F.  Specifically,  EXAMPLE  produces  a  string  x  e  /.  Let  P  be  a  probability  distribution  on  Z*. 
The  probability  that  a  string  x  e  /is  produced  by  any  call  of  EXAMPLE  is  the  conditional  probability  given 
by. 

Pjx) 

*e/ 

assuming  the  denominator  is  non-zero.  If  the  denominator  is  zero,  EXAMPLE  never  produces  any 
examples.  We  can  now  define  leamability  as  we  did  earlier. 

Defn:  A  family  of  concepts  F  is  feasibfy  leamable  with  one-sided  error  if  there  exists  an  algorithm  a 
such  that 

(a)  A  takes  as  inputs  integers  n  and  h,  where  n  is  the  size  parameter  arxj  h  the  error  parameter. 

(b)  A  makes  polynomiatiy  few  calls  of  EXAMPLE,  polynomial  in  n  and  h.  EXAMPLE  returns  positive 
examples  for  some  concept  f  e  F,  chosen  according  to  an  arbitrary  and  unknown  probabiiity 
distribution  P  on  2"”. 

(c)  For  all  concepts  /  e  Fand  all  probability  distributions  P  on  with  probability  (l-l/A),  A  outputs 

F  such  that  g^and 

T  F(x)  s  \/h. 
xi/Sg 

Defn:  We  say  a  family  of  concepts  F  is  well-ordered  if  for  all  n,  F,u0  is  closed  under  intersection. 
With  these  definitions  in  hand,  we  state  and  prove  the  following  theorem. 

Theorem  2:  A  family  F  of  concepts  is  feasibly  leamable  with  one-sided  error,  if  and  only  if  it  is  of 
polynomial  dimension  and  is  well-ordered. 

Proof:  (If)  This  direction  of  the  proof  begins  with  the  following  claim. 

Claim:  Let  SqIP~  be  any  non-empty  set  such  that  there  exists  a  concept  g  e  F^  containing  S.  i.e. 
ge  F^.  and  5cg.  If  F  is  well-ordered,  there  exists  a  /east concept /in  F„  containing  g,  i.e., 

^  geF^\S<zg  implies /eg. 

Proof:  Let  5  e  I""  be  non-empty  and  let  (/j,/2...)  be  the  set  of  concepts  in  F^  containing  S.  Now  the 
intersection  of  all  these  concepts  /  =  {/jr/jr^...),  is  in  F^.  To  see  this,  notice  that  since  F„u0  is  closed 
under  intersection, /€  F,u0.  But,/5t0asS;t0andSc/  Hence./eF,.  • 
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This  allows  us  to  write  the  following  learning  algorithm  for  F. 

LeamingLAIgortthm  Aj 

Input:  n.  h 
begin 

call  EXAMPLE  /i(d(n)  /rt(2)  +  tnih))  times, 
let  S  be  the  set  of  examples  seen, 
output  any  ;  in  F  such  that  is  the  least 
concept  in  containing  5. 

and 


Let /be  the  concept  to  be  learned.  Since  is  the  least  concept  consistent  with  s,  surely,  g„  c  /„. 
Using  arguments  identical  to  those  used  in  our  proof  of  Theorem  1 ,  we  can  show  that  with  probability 
greater  than  (l-l/A),  g  will  not  differ  from  the  concept  to  be  learned  with  probability  greater  than  l/h.  This 
completes  the  ’IT  direction  of  our  proof. 

(only  if)  Let  F  be  feasibly  leamable  with  one-sided  error  by  an  algorithm  A.  Let  us  show  that  F  is 
well-ordered,  i.e.,  for  all  n,  F,u0  is  closed  under  intersection.  Suppose  for  some  n,  F^<j0  were  not 
closed  under  intersection,  and  that/,  g  were  two  concepts  in  F,u0  such  that/n;  is  not  in  F„u0.  Now, 
surely /ng  *  0,  and  hence /ng  is  not  in  F^.  Place  the  probability  distribution  that  is  unifonn  on/ng  and 
zero  elsewhere  on  £"~,  and  run  the  learning  algorithm  a  for  A  a  2*^'.  At  each  call  of  EXAMPLE,  a 
randomly  chosen  element  of  /ng  will  be  returned.  Since  fng  is  not  in  F^,  A  must  fail  to  learn  with 
one-sided  error.  To  see  this,  suppose  that  A  outputs  some  concept  ee  F.  Now,  since  A  claims  to  learn 
with  one  sided  error,  «,£/,  if /were  the  concept  to  be  learned.  Similarly,  e^cg,  since  g  could  well  be  the 
concept  to  be  learned.  Hence,  e^ofng.  But  since  AbI/2^*,  must  be  fng,  which  contradicts  the 
assumption  that  fng  is  not  in  F^.  By  arguments  similar  to  those  of  our  proof  of  Theorem  1 ,  we  can  show 
that  F  must  be  of  polynomial  dimension.  An  alternate  proof  is  presented  in  [Natarajan,  1986].  Hence  the 
claim.  • 

Th's  completes  the  proof.  • 

We  now  exhibit  a  curious  property  of  the  weU-ordered  families.  Specifically,  we  show  that  each 
concept  (except  the  empty  set)  in  a  well-ordered  fantiiy  has  a  short  and  unique  ’signature”. 

For  a  well  ordered  family  F,  define  the  operator  F,  as  follows. 

Af^(g)  3  /  least /e  F,  such  that  Sc/,  if  such/exists 

In  words,  Af„(S)  is^si^lyl^^leaSI'sel  in  F^  consistent  with  S. 

Proposition  1 :  is  kJempotent,  i.e.. 

Proof:  By  the  definition  of  A#,,  Af^(S)  is  the  least  concept /e  F^  such  that  Sc/.  Surely,  MJf)  =/  and  hence 
the  proposition.  • 


Proposition  2:  For  a  cFc^"".  if  Af,(A)  and  Af,(8)  are  both  defined,  then 
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M^(A)  c 

Proof:  Bythedefinittonof  Since /icB.  Ac Hence,  M,(A)  cAf,(fl),  by  Proposition 

1.  • 


Proposition  3:  For  a.  3  c£*.  if  MJ,A)  and  MJfi)  are  defined, 

Af,{Aufl)  =  Af,(M,(A)vjAf,(B)) 

Proof:  Since  AcjV,(/t).ficJt#,(B).  AuB  c  MjiA)uM^(B).  Whence  it  follows  from  Proposition  2  that. 
A#,(AuB)  C  A#,(Af,(A)uA/^(B)).  And  then,  since  AcAuB,  we  have  by  Proposition  2 
•'f.CA)  C  Af,(AuB) 
and  similarly 
MJB)  c  A/,(AuB) 

Hence, 

M,(A)uAf,(B)  C  A#,(AuB) 

Applying  Proposition  2  again,  we  get 
Af,(W,CA)uAf,(B))  C  Af,(A#,(AuB)) 

Applying  Proposition  1  to  the  right-hand  side, 

Af,(A/,(A)uA/,(B))  c  Af,(AuB). 

Hence,  the  proposition.  • 

With  these  supporting  propositions  in  hand,  we  can  show  that  every  concept  in  F  has  a  small 
"signature". 

Proposition  4:  If  f  is  well-ordered,  then  for  every  /e  there  exists  SfCiF-,  LSy  <  dimiF^), 

such  that /=Af^(S^. 

Proof:  Let  /  €  and  let  5^ be  a  set  of  minimum  size  such  that/  -  Consider  any  two  distinct 
subsets  5j.  S2  of  5^  We  claim  that  *  MjiS2)-  To  prove  this,  we  will  assume  the  contrary  and  arrive 
at  a  contradiction.  Suppose  for  Jj  Without  loss  of  generality,  assume  l^jl  < 

Now, 

Sf=  (Sf-S‘^)^S2 
Applying  to  both  sides. 

Applying  Proposition  2  to  the  right-hand  side,  we  get 
Af„(S^  =  Af„(Af,(Syr-52)uAf„(S2)) 

Since  Af,(52)  =  Af„(S,). 

Applying  Proposition  2  again, 

Af,(S^=/»Af„((S^S2)^l) 

But  i(s^S2)u5iI  <  isy. 

which  contradicts  our  assumption  that  S^^was  a  set  of  minimum  size  such  that/=  Hence,  each 

distinct  subset  of  Sf  corresponds  to  a  distinct  /  €  F„.  (Notice  that  we  have  really  shown  that  5y  is 
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shattered  by  F^. )  Which  in  turn  innplies  that 
1F,I  S  2^5/ 
or 

dim(F^  2  tSy 
Hence  the  proposition.  • 

Conversely,  we  can  show  that  Proposition  4  is  tight  in  the  following  sense. 

Propositions:  If  F  is  wellHsrdered,  there  exists /e  such  that 
/=  M^{S)  implies  ISI  2 

Proof:  A  simple  counting  argument.  There  are  at  most  2*^  ‘  distinct  examples.  If  every /€  F^were 
definable  as  the  least  concept  containing  some  set  of  d  examples,  then 

2(»i+i)«i  5  If  j  or 

(/i+l)d  2  dim{F^  implyirig  d  2  dim{F ^Hn+X). 

Hence,  the  proposition.  • 


14 


4.  Time-Complexity  Issues  in  Learning  Sets 

Thus  far,  we  concerned  ourselves  with  the  information  complexity  of  learning,  i.e.„  the  number  of 
examples  required  to  learn.  Another  issue  to  be  considered  is  the  time-complexity  of  learning,  i.e.,  the 
time  required  to  process  the  examples.  In  order  to  permit  interesting  measures  of  time-complexity,  we 
must  specify  the  manner  in  which  the  learning  algorithm  identifies  its  approximation  to  the  unknown 
concept.  In  particular,  we  will  require  the  learning  algorithm  to  output  a  name  of  its  approximation  in 
some  predetermined  naming  scheme.  To  this  erxi,  we  define  the  notion  of  an  index  for  a  family  of 
concepts. 

In  order  for  each  concept  in  a  family  F  to  have  a  name  of  finite  length,  F  would  have  to  be  at  most 
countably  infinite.  Assuming  that  the  family  F  is  countably  infinite,  we  define  an  index  of  F  to  be  a 
function  I.F  2^  such  that 
V  /,g  e  F,f*g  implies  =  0. 

For  each/e  F,I(J)  is  the  set  of  indices  for/. 

We  are  primarily  interested  in  families  that  can  be  learnt  efficiently,  i.e.,  in  time  polynomial  in  the 
input  parameters  n,  h  and  in  the  length  of  the  shortest  index  for  the  concept  to  be  learned.  Analogous  to 
our  definition  of  leamability,  we  can  now  define  polynomial-time  leamability  as  follows.  Essentially,  a 
family  is  polynomial-time  leamable,  if  it  is  feasibly  leamable  by  a  polynomial-time  algorithm. 

Defn;  A  family  of  concepts  F  is  polynomial-time  leamable  in  an  index  i  if  there  exists  a  deterministic 
learning  al^rithm  A  such  that 

(a)  A  takes  as  input  integers  n  and  h. 

(b)  A  runs  in  time  polynomial  in  the  error  parameter  h,  the  length  parameter  n  and  in  the  length  of 
the  shortest  index  in  /  for  the  ooncefM  to  be  learned  /.  A  makes  polynomially  few  calls  of 
EXAMPLE,  polynomial^  in  n,  h.  EXAMPLE  retutm  examples  for/ chosen  randomly  according  to 
an  arbitrary  and  unknown  probability  distribution  P  on  Z'*~. 

(c)  For  all  concepts  /  in  F  and  all  probability  distributions  P  on  1!*~,  with  probability  (l-l/fj)  the 
algorithm  outputs  an  index  e  /(g)  of  a  concept  g  in  F  such  that 

T  P(x)  5  l/h 
xS%t 

We  are  interested  in  identifying  the  class  of  pairs  (F.  /),  where  F  is  a  family  of  concepts  and  /  is  an 
index  for  it,  such  that  F  is  polynomial-time  leamable  in  /.  To  this  end,  we  define  the  following. 

Defn:  For  a  family  F  and  index  /,  an  ordering  is  a  program  that 

(a)  takes  as  input  a  set  of  examples  S  =  ((x,o'i).  (X2,y2).-(*i0’,  -}  such  that 
Jt,,jc2,x3...  6  I*,  and  yj,  >2- e  (0,1). 

(b)  produces  as  output  an  index  in  /  of  a  concept/  €  F  that  is  consistent  with  5,  if  such  exists,  i.e., 
outputs  ye  /(/)  for  some/ 6  F  such  that 

V  0c,y)  e  S,  y 


^Altemativaly,  we  could  permit  A  to  make  as  many  calls  of  EXAMPLE  as  possible  within  its  time  bound.  This  will  not  change  our 
discussion  sutotantiaHy.  In  the  interest  of  clarity  we  will  not  pursue  this  alternative. 
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Furthernxsre.  if  the  ordering  runs  in  time  polynomial  in  the  length  of  its  input  and  the  length  of  the 
shortest  such  index,  we  say  it  is  a  polynomial-time  ordering  arxf  F  is  polynomial-time  orderable  in  /. 

With  these  definitions  in  hand,  we  can  state  the  following  theorem. 

Theorem  3:  A  family  of  concepts  is  polyrK)miai-time  leamable  in  an  index  /  (1)  if  it  is  of  polynomial 
dimension  and  is  poiymmieil-time  orderable  in  /.  (2)  only  if  F  is  of  polymmial  dimension  and  is  random 
polynomial  time  orderable  in  /.^ 

Proof:  (If)  Let  Q  be  a  polynomial-time  ordering  forF  in  /.  The  following  is  a  polynomial  time  learning 
algorithm  for  Fin/. 

Leaming_Algortthm 

input:  n.  h 
begin 

call  EXAMPLE  h(dun(F^)  +  log(h))  times; 
let  S  be  the  set  of  examples  seen; 
output  (2(S); 

end 


Given  Theorem  1,  we  know  that  A3  learns  F,  arxf  only  need  bound  its  running  time  polynomial.  Now, 
Q  runs  in  time  polynomial  in  the  size  of  its  input  and  the  length  of  the  shortest  index  of  any  concept 
consistent  with  5.  Since  the  concept  to  be  learned  must  be  consistent  with  S,  surely  Q  runs  in  time 
polynomial  in  n,  A  and  in  the  length  of  the  shortest  index  of  the  the  concept  to  be  learned.  Hence,  A3  oins 
in  time  polynomial  in  n,  A  and  in  the  length  of  the  shortest  index  for  the  concept  to  be  learned.  Therefore, 
F  is  polynomial-time  leamable  in  /. 

(Only  if)  Assume  that  F  is  polynomial  time  leamable  in  an  index  /  by  an  algorithm  A.  Since  A  calls  for 
polynomially  few  examples,  F  must  be  of  polynomial  dimension  by  Theorem  1.  It  remains  to  show  that 
there  exists  a  randomized  polynomial-time  ordering  for  F.  The  following  is  such  an  ordering. 

Ordering  o 

input:  5:set  of  examples,  n:integer; 
begin 

place  the  uniform  distribution  on  5; 
let  A  » 1^1: 

run  A  on  inputs  n.  A,  and 
on  each  call  of  EXAMPLE  by  A 
return  a  randomly  chosen  element  of  5. 
output  the  index  output  by  A. 
end 


Let /be  a  concept  consistent  with  5,  whose  iixfex  length  is  the  shortest  over  all  such  concepts.  Now, 
with  probability  (1-1/A)  A  must  output  the  index  of  a  concept  g  that  agrees  with  /with  probability  greater 


randomized  aigorithm  is  one  that  toeses  coirm  during  its  computation  and  proiAjcas  the  correct  answer  with  high  probability 
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than  (l-i/A).  Since  the  distribution  is  uniform  and  h  >  \S\.  g  must  agree  with  /  on  every  example  in  S. 
Hence  with  high  probability,  g  is  consistent  with  S.  Furthermore,  since  /I  is  a  polynomial-time  learning 
algorithm  for  F,  our  ordering  O  is  a  randomized  polymmial-time  ordering  for  F  in  /.  To  see  this,  notice 
that  A  runs  in  time  polynomial  in  n  and  h,  arxf  I,  the  length  of  the  shortest  index  of/.  By  our  choice  of  h,  it 
follows  that  A  runs  in  time  polynomial  in  n,  151  and  l.  Hence,  0  runs  in  time  polynomial  in  n,  h  and  /,  and  is 
a  randomized  polynomial-time  ordering  for  F  in  /. 

This  completes  the  proof.  • 

We  can  state  analogous  results  on  the  time-complexity  of  learning  with  one-sided  error.  Specifically, 
an  ordering  for  a  well-ordered  family  would  be  an  ordering  as  defined  earlier  with  the  exception  that  it 
would  produce  the  least  concept  consistent  with  the  input.  Also,  we  can  modify  our  definition  of 
polynomial  time  leamabiiity  to  allow  only  one-sided  error.  We  can  then  state  and  prove  the  following. 

Theorem  4:  A  family  F  is  polynomial-time  teamable  with  one-sided  error;  (1)  if  it  is  of  polynomial 
dimension,  well-ordered  and  possesses  a  polynomial  time  ordering;  (2)  only  if  it  is  of  polynomial 
dimension,  well-ordered  and  possesses  a  random  polynomial  time  ordering. 

Proof:  A  straightforward  extension  of  earlier  proofs.  • 
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5.  Learning  F' motions 

In  the  foregoing,  we  were  concerned  with  learning  approximations  to  concepts  or  sets.  In  the  more 
general  setting,  one  may  consider  learning  functions  from  IT  to  £*.  To  do  so.  we  must  first  modify  our 
definitions  suitably  and  generalize  our  formulation  of  the  problem. 


Defn:  We  define  a  family  of  functions  to  be  any  set  of  functions  from  £*  to  V.  For  any/e  the 
projection/,,;!''^ I"  of/on  I"  is  given  by 


X,  otherwise 


Defn:  The  n‘*-subfamily  F„  of  F  is  the  projection  of  F  on  S',  i.e, 

n. 


The  above  two  definitions  are  the  analogues  of  the  corresponding  definitions  for  sets.  The  notion  of 
the  projection/,,  of  a  function  /  attempts  to  capture  the  behaviour  of /on  strings  of  length  n.  If  for  some 
xe  IT,  fix)  is  not  of  length  at  most  n,  it  is  truncated  to  n  characters. 


An  example  for  a  function  /  is  a  pair  (x,y).  x^  eS  such  that  y  s/[x).  A  learning  algorithm  (or  more 
precisely  a  learning  function)  for  a  family  of  functions  is  an  algorithm  that  attempts  to  infer  approximations 
to  functions  in  F  from  examples  for  it.  The  learning  algorithm  has  at  its  disposal  a  subroutine  EXAMPLE, 
which  at  each  call  produces  a  randomly  chosen  example  for  the  function  to  be  learned.  The  examples 
are  chosen  according  to  an  arbitrary  and  unknown  probability  distribution  p  in  that  the  probability  that  a 
particular  example  (x,^x))  will  be  produced  at  any  call  is  F(x). 

As  in  the  case  of  sets,  we  define  leamability  as  follows. 

Defn:  A  family  of  functions  F  is  feasibly  iearnabie  if  there  exists  an  algorithm  A  such  that 

(a)  A  takes  as  incxit  integers  n  and  h,  where  n  is  the  size  parameter  and  h  the  error  parameter. 

(b) A  makes  polynomially  few  calls  of  EXAMPLE,  polynomial  in  n  and  h.  EXAMPLE  returns 
examples  for  some  function  /„  e  F„,  chosen  according  to  an  arbitrary  and  unknown  probability 
distribution  P  on  I"". 

(c)  For  all  functions  /„  e  F„  and  ail  probability  distributions  P  on  Z"-,  with  probability  (l-l/A),  A 
outputs  a  a  function  g^F  such  that 

£  Pix)  5  1/A 

Our  definition  of  dimension  in  this  setting  is  exactly  the  same  as  the  one  given  earlier  for  concepts. 
We  can  now  generalize  the  notion  of  shattering  as  follows. 

Defn:  Let  F  be  a  family  of  functions  from  a  set  X  to  a  set  Y.  We  say  F  shatters  a  set  5cX  if  there 
exist  two  functions/,  g  e  F  such  that 

(a)  for  any  r  e  S,As)  *  g{s). 

(b)  for  all  Sj  c  S,  there  exist  e  e  F  such  that  e  agrees  with/on  Sj  and  with  g  on  5-S,.  i.e.. 
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Vj  e  S^:e(s)^J{s) 

Vi  €  S-S,:e(j)  =  «(j). 

We  can  now  generalize  our  shattering  lemma  for  functions  as  follows. 

Lemma  2  (Generalized  Shattering  Lemma):  If  is  of  dimension  d,  shatters  a  set  of  size 
c«iing(d/(3n+3))).  Also,  every  set  shattered  by  F^  is  of  size  at  most  d. 

Proof:  The  upper  bound  part  of  the  lemma  can  be  proved  exactly  as  the  corresponding  part  of 
Lemma  1 .  To  see  that  this  upper  bound  can  be  attained,  we  simply  need  to  consider  a  family  F^  of 
(0,l)-valued  functions. 

The  lower  bound  part  of  the  lemma  is  proved  through  the  following  claim. 

Claim:  Let  X  and  Y  be  two  finite  sets  and  let  //  be  a  set  of  functions  from  X  to  Y.  If  k  is  the  size  of 
the  largest  subset  of  X  shattered  by  H,  then 
irti  5  (ixi)*(in)2*. 

Proof:  By  induction  on  lYI. 

Basis:  Clearly  true  for  Ufl » 1,  for  all  in. 

Induction:  Assume  true  for  ixi » f,  in«  m  and  prove  tme  for  IXl  =  /+!,  in  *  m.  Let  X  *  {x, ,  Xj . . .,  x,]  and 
T  *  {yi.  yj . . ..  yf)  •  Define  the  subsets  //^  of  //  as  follows. 

«/=  {/l/e  H,Axi)  =  yi]. 

Also,  define  the  sets  of  functions  H-  and  Hq  as  follows, 
fon*;;//^ a  {/■!/€  Hj such  thatf^g on X-{xy} ). 

Now, 

Wl  =  WoU  S  Wol  +  X  iH.l 

‘  *j 

We  seek  bounds  on  the  quantities  on  the  right-hand  side  of  the  last  inequality.  By  definition,  the  functions 
in  Hq  are  all  distinct  on  the  m  elements  of  X-ixj).  Furthermore,  the  largest  set  shattered  in  Hq  must  be  of 
cardinality  no  greater  than  k.  Hence,  we  have  by  the  inductive  hypothesis, 

IHgl  S 

And  then,  every  ff-y  shatters  a  set  of  cardinality  at  most  i-l,  as  otherwise  fi  would  shatter  a  set  of 
cardinality  greater  than  k.  Also,  since  the  functions  in  Hjj  are  all  distinct  on  X  -  (xj),  we  have  by  the 
inductive  hypothesis, 

Forf  *  ^ 
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Combining  the  last  three  iner^alities,  we  have 


S  /n^‘/“(m+l)  S  (m+l)*/“. 

Which  completes  the  proof  of  the  claim.  • 

Returning  to  the  lemma,  we  have  X  £"~,  arxJ  hence  /  >  m  =  2*^^  If  jfc  is  the  cardinality  of  the 
largest  set  in  27^  shattered  by  we  have  by  our  claim, 

1F,I  S  (2'^‘)*(2’^‘)2* 

2  2*<3<h-3). 

Taking  logarithms, 

/og(lf  ,)l  =  dim(F^)  =  S  kOn+Z) 

Hence,  k  >  dliZn+Z),  which  is  as  desired.  •. 

Using  this  lemma,  we  can  prove  the  following  theorem. 

Theorem  5:  A  family  of  functions  is  feasibly  leamable  if  and  only  if  it  is  of  polynomial  dimension. 

Proof:  Similar  to  the  proof  of  Theorem  1,  except  that  we  need  use  the  generalized  notion  of 
shattering  and  the  corresponding  generalized  shattering  lemma.  • 

Analogous  to  our  development  of  time-complexity  considerations  for  concept  learning,  we  define  the 
following. 

For  a  family  of  functions  F  of  countable  cardinality,  we  define  an  index  /  to  be  a  naming  scheme  for 
the  functions  in  F,  in  a  sense  identicai  to  that  for  a  family  of  concepts. 

We  say  a  family  of  functions  F  is  polynomial-time  leamable  in  an  index  /,  if  there  exists  a 
deterministic  learning  algorithm  a  such  that 

(a)  A  takes  as  input  integers  n  and  A. 

(b)  A  runs  in  time  polynomial  in  the  error  parameter  A,  the  length  parameter  n  and  in  the  length  of 
the  shortest  index  in  /  for  the  function  to  be  learned  /.  A  makes  polynomially  few  calls  of 
EXAMPLE,  polynomial  in  n,  A.  EXAMPLE  returns  examples  for/,  chosen  randomly  according  to 
an  arbitrary  arxJ  unknown  probability  distribution  P  on 

(c)  For  all  concepts  /  in  F  and  all  probability  distributions  P  on  £”,  with  probability  (l-l/A)  the 
algorithm  outputs  an  index  e  /(g)  of  a  function  g  in  F  such  that 

We  are  interested  in  identifying  the  class  of  pairs  (F,  /),  where  F  is  a  family  of  concepts  and  /  is  an 
index  for  it,  such  that  F  is  polynomial-time  leamable  in  /.  To  this  end,  we  define  the  following. 

Oefn:  For  a  family  F  and  index  /,  an  ordering  is  a  program  that 
(a)  takes  as  input  a  set  of  examples  S  =  {(x,,y,),  (xj,  yj)— .  (x,.y, )...}.  Let  n  be  the  length  of  the 
longest  string  atmng  the  x,  and  y,. 
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(b)  produces  as  output  an  index  in  /  of  a  concept  /  €  F  that  is  consistent  with  S,  if  such  exists,  i.e., 
outputs  rye  /(O  for  some/ €  F  such  that 

V(x.y)  e  5.  y=Ux). 

Furthermore,  if  the  ordering  luns  in  time  polynomial  in  the  length  of  its  input  and  the  length  of  the  shortest 
such  index,  we  say  it  is  a  polynomial-time  ordering  and  F  is  polynomial-time  orderable  in  I. 

With  these  definitions  in  harxf,  we  can  state  the  following  theorem. 

Theorem  6:  A  family  of  functions  is  polynomial-time  leamabie:  (1)  if  it  is  of  polynomial  dimension 
arKi  polynomial-time  orderable;  (2)  only  if  it  is  of  poiynomiai  dimension  and  is  orderable  in  random 
polynomial  time. 

Proof:  Similar  to  that  of  Theorem  3.  • 
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6.  Finite  Learnability 

Thus  far  we  explored  the  asymptotic  learnability  of  families  of  sets  and  functions,  that  is  to  say,  we 
considered  the  asymptotic  variation  of  the  number  of  examples  needed  for  learning  with  increasing  values 
of  the  size  parameter.  We  will  now  investigate  a  different  notion  of  learnability,  one  that  asks  whether  the 
number  of  examples  needed  for  learning  is  finite,  i.e,  varies  as  a  finite-valued  function  of  the  error 
parameter,  without  regard  to  the  size  parameter.  We  call  this  notion  of  learnability  linite  leamability”  as 
opposed  to  the  notion  of  asymptotic  learnability. 

For  the  case  of  families  of  sets,  [Blumer  et  ai.,  1986]  present  conditions  necessary  and  sufficient  for 
finite-leamability.  Their  elegarrt  results  rely  on  the  powerful  results  in  classical  probability  theory  of 
[Vapnik  arxf  Chervonenkis,  1971],  In  the  following  we  review  their  results  briefly  arxf  then  go  on  to 
present  leamability  results  for  families  of  functions,  relying  in  part  on  the  same  results  of  [Vapnik  and 
Chervonenkis,  1971], 

Defn:  Let  F  be  a  family  of  sets  on  R*,  where  R  is  the  set  of  reals  and  k  \s  a  fixed  natural  number. 
We  say  F  is  finitely  leamable  if  there  exists  an  algorithm  A  such  that 

(a)  A  takes  as  input  integer  h,  the  error  parameter. 

(b)  A  makes  finitely  many  calls  of  EXAMPLE,  although  the  exact  number  of  calls  may  depend  on  h. 
EXAMPLE  returrts  examples  for  some  function /in  F,  where  the  examples  are  chosen  randomly 
according  to  an  arbitrary  arxf  unimown  probability  distribution  P  on  R. 

(c)  For  all  probability  distributions  P  and  all  functions/in  F,  with  probability  (l-l/Zi),  A  outputs  ge  F 
f  dP  i  Uh 

The  following  theorem  is  from  [Blumer  et  al.,  1986]. 

Theorem  7:  [Blumer  et  al.,  1986]  A  family  of  sets  F  on  R*  is  finitely  leamable  if  and  only  if  F  shatters 
only  finite  subsets  of  R*.  ([Blumer  et  al.,  1986]  refer  to  the  size  of  the  largest  set  shattered  by  F  as  the 
Vapnik-Chervonenkis  dimension  of  the  family  F). 

Let  us  now  formalize  the  notion  of  finite  leamability  of  families  of  functions  on  the  reals. 

Oefn:  Let  F  be  a  family  of  functions  from  R*  to  R*.  where  R  is  the  set  of  reals  arxf  /l:  is  a  fixed  natural 
number.  We  say  F  is  finitely  leamable  if  there  exists  an  algorithm  A  such  that 

(a)  A  takes  as  input  integer  h,  the  error  parameter. 

(b)  A  makes  finitely  many  calls  of  EXAMPLE,  although  the  exact  number  of  calls  may  depend  on  h. 
EXAMPLE  returns  examples  for  some  function /in  F,  where  the  examples  are  chosen  randomly 
according  to  an  arbitrary  arxf  unknown  probability  distribution  P  on  R*. 

(c)  For  all  probability  distributions  P  and  all  functions/ in  F,  with  probability  (1-l/fi),  A  outputs  ge  F 
such  that 


We  need  the  following  supporting  definitions.  Let /be  a  function  from  R*  to  R*.  We  define  the  graph 
of  /,  denoted  by  graphif),  to  be  the  set  of  all  examples  for /.  That  is. 
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graphif)  =  {(x,y)l  y  »/(x)) . 

Cleariy,  graphif)  c  R^R*.  Analogously,  for  a  family  of  functions  F,  we  define  graph{F)  to  be  the  set  of 
graphs  for  the  functions  in  F.  That  is, 
graph{F)=  [graph(f)\f  e  F]. 

We  now  state  the  main  theorem  of  this  section.  The  theorem  is  not  tight  in  the  sense  that  the 
necessary  and  sufficient  conditions  do  not  match.  (In  {Natarajan,  1988],  a  tight  version  of  the  theorem 
was  reported,  on  the  basis  of  an  incorrect  proof.)  Indeed,  we  will  identify  a  finitely  leamable  family  of 
functions  that  sits  in  the  gap  between  these  conditions. 

Theorem  8:  A  family  of  functions  F  from  R^  to  R*  is  finitely  leamable 

(a)  If  there  exists  a  bound  on  the  size  of  the  sets  in  R^R*  shattered  by  graph(F).  (simple  shattering 
as  defined  in  Section  2.) 

(b)  Only  if  there  exists  a  bound  on  the  size  of  the  sets  in  R*  shattered  by  F.  (Generalized  shattering 
as  defined  in  Section  5.) 

Proof:  (If)  This  direction  of  the  proof  follows  from  the  convergence  results  of  [Vapnik  and 
Chervonenkis,  1971]  exactly  as  shown  in  [Blumer  et  al.,  1986].  Essentially,  the  ’IT  condition  implies  that 
the  family  graph^f)  is  finitely  leamable.  Whence  it  follows  that  the  family  F  is  finitely  leamable. 

(Only  if)  This  direction  of  the  proof  is  identical  to  the  asymptotic  case  of  Theorem  4,  which  in  turn 
followed  the  arguments  of  Theorem  1 .  • 

While  Theorem  8  is  not  tight,  it  appears  that  tightening  it  is  a  rather  difficult  task.  Indeed  we 
conjecture  that  the  ’if”  condition  should  match  the  ’only  iT  condition  as  stated  below. 

Conjecture:  A  family  of  functions  F  from  R*  to  R*  is  finitely  leamable  if  and  only  if  there  exists  a 
bound  on  the  size  of  the  sets  in  R*  shattered  by  F. 

To  give  the  reader  a  flavour  of  the  difficulties  involved  in  tightening  Theorem  8,  we  give  an  example 
of  a  family  F  of  functions  that  lies  in  the  gap  between  the  necessary  and  sufficient  conditions  of  Theorem 
8,  i.e 

(a)  F  shatters  sets  of  size  at  most  one. 

(b)  graphiF)  shatters  arbitrarily  large  sets. 

(c)  F  is  finitely  leamable. 


Example:  Let  N  be  the  natural  numbers  in  binary  representation.  For  any  as  N,  define  the  function 
as  follows. 


a,  if  the  j/*  bit  of  o  is  1 
0  otherwise 


Define  the  family  F  as  follows. 
F={4loe  N). 
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Claim:  F  shatters  sets  of  size  at  most  one. 

a 

Proof:  Suppose  F  shatters  a  set  of  size  greater  than  one.  Then  F  must  shatter  a  set  of  size  2.  Let 
S  =  [aM]  be  such  a  set.  By  definition,  there  exist  three  functions/,  g.e  inF  such  lt\atf{a)^g(a),f{b)*gib) 
and  e{a)  =/{a),  eib)  =  gib).  Since,  Aa)*gia\  one  of  them  must  be  zero  and  the  other  non-zero.  Without 
loss  of  generality,  assume  that /(a)  is  rwn-zero.  Now,  by  the  definition  of  the  functions  in  F.jia)  =  e(a)  *  0 
implies  that/=  «.  This  contradicts  the  assumption  that  e(b)  =  gib)  *■  fib),  and  hence  the  claim.  • 

Claim:  graphiF)  shatters  arbitrarily  large  sets. 

Proof:  Let  S,  be  any  arbitrarily  large  but  finite  subset  of  N.  Consider  S  =  5ix{0).  It  is  easy  to  see 
that  graphiF)  shatters  S,  as  for  any  subset  of  S,  there  exists  a  set  /  €  F  such  that/n  s  =  S2-  To  see 
this,  notice  that  for  any  subset  of  S,  we  can  pick  an  integer  ae  N,  such  that  f^r,S  =  Sj.  Since  S  was 
picked  to  be  arbitrarily  large,  the  claim  is  proved.  • 

Claim:  F  is  finitely  leamable. 

Proof:  The  following  is  a  learning  algorithm  for  F. 

Learning  Algorithm 
Input  /t; 

begin 

call  for  hlogih)  examples. 

If  any  of  the  examples  seen  is  of  the 
formix,y),y*0 
then  output /y 
else  output /q. 
end 


It  is  easy  to  show  that  the  probabilities  work  out  for  algorithm  A  above.  Suppose  the  function  to  be 
learned  were/^,  for  some  a  5*0.  Then,  if 
f  dP  >  Uh, 


with  probability  (l-l/A),  in  hlogh  examples  there  must  be  an  example  of  the  form  ix,a).  In  which  case,  the 
algorithm  will  output  /^,  implying  that  with  probability  (l-l/A),  the  algorithm  learns  the  unknown  function 
exactly.  Hence  the  claim.  • 

The  interesting  thing  about  the  functions  in  F  is  that  each  function  differs  from  the  base  function /q  on 
finitely  many  points,  and  on  these  points,  the  value  of  the  function  is  the  name  of  the  function.  Hence,  if 
the  learning  algorithm  sees  a  non-zero  value  in  an  example,  it  can  uniquely  identify  the  function  be 
learned.  • 

Thus  far,  we  considered  functions  on  real  spaces,  requiring  that  on  a  randomly  chosen  point,  with 
high  probability  the  learner's  approximation  agree  exactly  with  the  function  to  be  learned.  This  requires 
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infinite  precision  arithmetic  and  hence  is  largely  of  technical  interest.  But  then,  if  all  the  computations  are 
carried  out  only  to  some  finite  precision.  Theorem  5  would  apply  directly.  Alternatively,  we  could  require 
that  the  learned  function  approximate  the  target  function  with  respect  to  some  predetermined  norm.  In 
the  following,  we  consider  the  case  of  the  square  norm,  for  a  single  probability  distribution  P. 

Rrst,  we  limit  the  discussion  to  families  of  "normalized*  functions.  Let  E{ajb)  denote  the  euclidean 
distance  between  any  two  points  a  and  b.  Let  f  be  a  family  of  functions  such  that  for  every  /e  F 

and  xB  R‘,  E{/{x),0^  <  1.  where  0*  is  the  origin  in  R*.  Then,  we  fix  the  probability  distribution  P. 

Defn:  We  say  that  F  is  finitely  leamable  with  respect  to  the  square  norm  and  a  distribution  P  on  R*. 
if  there  exists  an  algorithm  A  such  that; 

(a)  A  takes  as  input  an  integer  h,  the  error  parameter. 

(b)  A  makes  finitely  many  calls  of  EXAMPLE,  though  the  exact  number  may  depend  on  h. 
EXAMPLE  returns  examples  for  some  function  /in  F,  where  the  examples  are  chosen  according 
to  the  distribution  P. 

(c)  For  all  functions  /  €  F,  with  probability  A.  A  outputs  a  function  g  b  F  such  that 
f  mxUi^))dP  ^Vh. 

Jxe  R* 

Before  we  can  state  our  result  in  this  setting,  we  need  the  following  definition,  adapted  from 
[Benedeck  and  Itai,  1988]. 


Defn:  For  small  positive  8;  K^F  is  a  5-cover  with  respect  to  the  square  norm  and  distribution  P  if,  for 
any/6  F  there  exists  K  such  that, 

f  E(f{xUix))dP  <5 
Jxe  R* 

Theorem  9:  A  family  of  functions  is  finitely  leamable  with  respect  to  the  square  norm  and  a 
distribution  P,  if  and  only  if  for  all  positive  5,  there  exists  a  finite  5-cover  for  F. 


Proof;  The  details  of  the  proof  are  identical  to  that  of  the  main  theorem  of  [Benedeck  and  Itai. 
1988].  A  learning  algorithm  A  for  F  can  be  described  as  follows;  on  input  A,  A  constructs  an  i/A-cover  of  F 
of  minimum  size.  A  then  calls  for  sufficiently  many  examples  to  penoit  it  to  pick  one  of  the  functions  in  the 
knot  with  sufficiently  high  confidence.  • 
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